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Embedded System Design Dilemma 


COMPUTE 
PERFORMANCE 


SYSTEM TIME-TO-MARKET 
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Solving Compute Intensive Problems 


« Processor with embedded 
programmable logic 
— Lots of compute resources 


» Utilize fabric by extending the ISA 
— Design your own FU 
— Add processor state 


CACHE | SRAM MMU 


= Simple programming model 
— Issue instructions to perform 
computation 


LOGIC i i - Ee me 
CONTROL ae 


« Tailored for high-throughput 
applications 
— Compute intensive 
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Compute Intensive Markets & Applications 


«cary & Se 
giv? y Cur, 


¢ Sonar/Radar ¢ Broadcast Equipment 
* Biometrics scal Elec «workj,,_ * Audio Studio 
Ss 


« CAT Scan 
¢ Ultrasound ¢ Encryption/Decryption 


¢ Network Security 


¢ Printers 
¢ Scanners 
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EEMBC Telema 


Stretch S5000 - 300 MHz- C OPT 
Intrinsity FastMATH - 2 GHz- ASM OPT 
TI.C6416 - 720 MHz- ASM OPT 
TIC6416 - 720 MHz- C OPT 

BOPS Manta v2.0 - - MHz- ASM OPT 
BOPS Manta v2.0 - - MHz- C OPT 
Motorola MPC7447 - 1.3GHz- Altivec OPT 
Motorola MPC7455 - 1GHz- Altivec OPT 
TI. C6203 - 300 MHz- ASM OPT 
TI.C6203 - 300 MHz- C OPT 

LSI 402ZX - 200 MHz- ASM OPT 
DMs be ee Lee RAI TA AT Cie W ans md 
Motorola MPC7455-1GHz 

TI TMS320C6416-720 

IBM 440GX - 667 MHz 

Intrinsity FastMATH 2GHz 

PMC Sierra RM7000C - 625MHz 
Motorola PowerPC 7400-500 

IBM 440GP-500 

IBM PowerPC 750CX - 500 

MIPS 20Kc 600 MHz 

Motorola MPC755-400 

IBM 440GP-500 

AMD K6-IIIE+ 550/ACR 

Motorola MPC8245-400 

AMD K6-2E+ 500/ACR 

IBM 405GPr - 400 MHz 

NEC VR5500 - 400 

TI TMS320C6203-300 

AMD K6-2E 400/ACR 

Motorola PowerPC 6036 - 300 

AMD Au1100 - 396 MHz 

NEC VR5500 - 300 

Infineon Carmel 10xx-170 

IBM 405GPr - 266 MHz 

Stretch S5000 - 300 MHz 

NEC VR5000 - 250 

LSI402ZX - 200 
IDT79RC64575-250 

Toshiba TMPR4927ATB-200 


Processor 
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EEMBC Telemark Score 
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Application Acceleration Process 
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RGB > YC,C, 


Color Conversion Function: 
SE_FUNC /* Tells Stretch C-Compiler to reduce this function to an instruction */ 
Void RGB2YCBCR (WR A, WR *B) { 

char Y[4], Cb[4], Cr[4]; 

char R[4], G[4], B[4]; 


R[O] = A(23,16); G[O] = A(15,8); B[O] = A(7,0); 
/* and so on. .. */ 


for (i = 0; i < 4; itt) { 


Y[i] = (77*R[i] + 150*G[i] + 29*B[i]) >> 8; 
Cb[i] = (32768 - 43*R[i] - 85*G[i] + (B[i] << 7)) >> 9; 
Cr[i] = (32768 + (R[2] << 7) = 107*G[2] = 21*B[i]) >> 9; 


} 
*B = (Y([3] ,Cr[3]+Cr[2],Y12],Cbi3)]+ebi2] ,Y[11 Celli +ecr[O] ,Y[0],cbh [12] +Cb[0]),; 


} 


Program Loop: 


ror (:.) “1 
WRGETO(&A, 12); /* Load 12 bytes (4 RGB pixels) */ 
RGB2YCBCR(A, &B) ; /* Convert 4 pixels * / 
WRPUTO(B, 8); /* Store 8 bytes (4 YCbCr pixels */ 


} 
© 2005 Stretch Inc. All rights reserved. 7 Stretch 


Programming Flow 


— __ 


APPLICATION ANNOTATED 
C/Ere Ce 
i’ 
—— 
S=SX = 


Designer Stretch > 
€&- kK C-Compiler — y —r 


<S 
?—S- EXECUTABLE 
Ss} 
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Development & Debug 


=» Programmers develop & 
debug using familiar tools 


= Faster debug cycle 


= Port & accelerate apps 
within Stretch IDE 


= No hardware design 
experience required! 
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oe w @ 
> ao R 


EB Stretch IDE - helloworld-xtensa.exe [break] - P:\tmp\test\helloworld.c 


File Edit Yiew Debug Preferences Window Help 


a 8S 


void 

hellostagel(int a) 

if 
char bad = ‘a’; 
char *bada = &bad; 
char *bada0 = &bad; 
char good = 'b' 
char *gooda 
float badb 
void *badp 


&good: 
a: 
&badb: 


printf("Hello World! (%d)\n". a); 
FE 


void 
hellostage0(int a, int b) 
{ 


hellostagel (a+b); 

hellostagel(atb); 

hellostagel (ath); 
3 


int 
nain() 


unsigned int __attribute_((aligned (16))) abuff[4] 
UR aval: 
struct hey 


heyyou.one ‘ 

heyyou. two 2.0; 
WRALI28I(heyyou.three, &(abuff[0]). 0): 
heyyou.four = éheyyou: 

WRAL128I (aval, &(abuff[0)), 0): 
hellostage0(5, 6): 


= { 0%12345678, Ox30abedef , 


EJ) 


= x 


Debuager 


Name | Value Type 
= 
[0] 0«12345678 unsigned int 
[1] Ox80abedet unsigned int 
[2] Oxdeadbeet unsigned int 
[3] Oxfeedbabe unsigned int 
aval Oxfeedbabedeadb... WR 
# heyyou {..} struct hey 


Locals | Watch 1 


Address: | Oxd7fffeaO 


Loe) 


OxD7FFFEAO 00 
OxD7FFFEA? 12 
OxD7FFFEAE 00 
OxD7FFFEBS 
OxD7FFFEBC 
OxD7FFFEC3 
OxD7FFFECA 
OxD7FFFED1 
OxD7FFFEDS 
OxD7FFFEDF 
OxD7FFFEE6 


Memory 1 


Find Results | Output 


tell] hellowarld.c:14:hellostage' [int a} 


helloworld.c:30: hellostage! (a+b); 
helloworld.c:31: hellostage! {a+b}; 
helloworld.c:32: hellostagel [a+b]; 


Frame | Name 
0 hellostage'{a=11} 
1 hellostageO{a=5 b=6) 
2 maing 
3 _ start) 


Debugger 


Call Stack | Breakpoints | Watchpoints 


Address File 
Oxd0000074 helloworld.c:16 
Oxd0000ed4 helloworld.c:30 
Oxd0000d38 helloworld.c:47 
Oxd0000c08 


Line 47 Col 1 


Stretch S5 Development Flow 


Gnu-based Target 
C/C++ Optimizing simulator GUI Debugger 
compiler or chip 


Application 
code 


Compile ;—» Execute 


Create or 


Modify Modify Select 
application Extension hot-spot 
Instructions 
Easy to 
One El does the work of use GUI 


many operations 
and iterations 
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Under The Hood 
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$5 Datapath and Key Parameters 


—— 


Wide Register File 
« 32 128-bit entries 


Load/store unit 

= 128-bit load/store 

« Auto increment/decrement 

« Immediate, indirect, 
circular 

» Variable-byte load/store 

« Variable-bit load/store 


@eeeeeeeeeeec0e 


32-BiT RF 


r) : 
2 a" INSTRUCTION SET : 
5 EXTENSION Fasric Iw 
: ISEF 
« 3 inputs and 2 outputs 
eceoeeceeeeeeeee eeoeceeeeeeecce |_| Pipelined, interlocked 


« 32 16-bit MACs and 256 ALUs 


RISC Processor « Bit-sliced for arbitrary bit-width 


» Tensilica — Xtensa V 


» 32KB1&D Cache 5 i oe - 
a On-Chip Memory, MMU ] ow internal processor state 
» 24 Channels of DMA, FPU 
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Instruction Set Architecture 


—— ee 


= Based on Xtensa ISA 
= Single opcode space 
— Different for each application 
=» Hardware “caches” El subset 
— # instructions not limited by hardware 


Cached set 


Always present / 


Xtensa V WR loads/stores _El’s App 1 


Xtensa V WR Loads/stores El’s App 2 
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Matching Operating Frequency 


= Two isochronous clock domains 
— Processor runs @ higher MHZ 
— Programmable clock ratio (1:1-+1:4) 


— WR acts as gateway between them 
— Constraints El issue rate Decoder 
= Clock ratio invisible to programmer 


=» Table-driven El decode 


WH ISEF clock domain 
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S5 Pipeline 


« Standard 5-stage RISC pipeline 
= Extension instruction may be 
much longer 
— Fully pipelined 
«= Compiler schedules all 
instructions 
=» Hardware interlocks ensure 
correctness 
— Instruction groups subdivided by 
use/def requirements 


— Interlocks driven by config table in 
Instruction Unit 
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Pipeline Schedule 


—— ee 


| » Extension Instructions may have 
ne : different latencies 


PELE eer tere « Initiation interval of El a function 


(transient = of. 


PeCececace cece ~ WR and state use/def 
Oo = Optimal schedule typically has 
|—~— hk Oo initiation interval > 1 
oc __ I Sigaay — No performance penalty for 
77 es State lower clock rate! 
EEE a8 « Statically scheduled 


nit ___ — Pipeline behavior is completely 


Te ited — known at compile time 
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ISEF Architecture 
—— re 


= Array of 64-bit ALUs 
— May be broken at 4-bit or 1-bit boundaries 
— Conditional ALU operation: Y = C ? (A op, B) : (A op, B) 
— Registers to implement state, pipelining 
= Array of 4x8 multipliers, cascadable to 32x32 
— Programmable pipelining 
Programmable routing fabric 
Execution clock divided from processor clock: 1:1, 1:2, 1:3, 1:4 
Up to 16 instructions per ISEF configuration 
Many configurations per executable 
Reconfiguration 
— User directed or on demand 
— ~100 usec for complete reconfiguration 


Multiplier ALU Multiplier ALU 
Array Array Array Array 


Multiplier ALU 
Array Array 


ISEF Block ISEF Block ISEF Block 
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Dynamic ISEF Configuration 


— 


Two ISEFs 

» Each holds up to 16 instructions 
» Configured via system bus 

» Managed by DMA 

» Independently configurable 


Wide Register File 


On-Demand Configuration 
» Auto managed by HW and OS 


» Handled like cache miss 


Config|Port Configuration Preloading 
» Managed by application 
» Handled like instruction pre-fetch 


System Bus 


Task ———__—_—_—___—___—- 
Config 


On-Demand Configuration Configuration Preloadin 
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ISEF Bitstream Generation 


Chr 


© 2005 Stretch Inc. All rights reserved. 19 gz Stretch 


C Compiler Tradeoffs 


——— re 


STRETCH C-COMPILER 


EXECUTABLE 


GLOBAL OPTIMIZATION 
REGISTER ALLOCATION 
LOOP OPTIMIZATION 
INSTRUCTION SCHEDULING 


CODE GEN 


cc 
LL 
N 
a 
< 
zZ 
< 


APPLICATION 


Operand Timing 
State Timing 
Issue Rate 


XQ SYNTHESS | P&R 
SS —hae PIPELINING & BALANCING =s=-5 
STATIC TIMING VERIFICATION 


ISEF BIT STREAM 


vo 
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Examples of Extension Instructions 
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RGB2YCbCr: SCP Instruction Efficiency 
— __ ——— ee 


96 bits \« W — 


SCALAR 
PROCESSOR 


\ KV et perpen 
\\ 


2 SU 


Flexibility in FIR Implementations 
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spe 


vs. compute 


Wide Load & Stores fe 
compute array 


> 


10 11 12 13 14 15 


eeeee ee @ 
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Sample Applications 
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H.264 Encoder Challenges 
—e Gg 


Source 
Picture 


Forward 


Transtorn Quantization 


r 


Select 


Me Pred 


Deblocking Inverse Inverse 
Filter Transform Quantization 


Compute Intensive Tasks 
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H.264 Encode 


INTRA-FRAME, MC 54 Encoding Acceleration 

Q ms ree mizes multiple-DSP designs 

‘man-months to port algorithms 
DE-BLOCKING 1 C/C++ 

FILTER 
SD, 30 FPS, MAIN-PROFILE 
MOTION 
ESTIMATION 1150% | 
—- < > =e SPL | 
REST 


ACCELERATED C 


26 = Stretch 


ORIGINAL C 
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802.16d (WiMAX) Challenges 


PHY 


RF Freq/Time FREQ DOMAI Soft FEC Ah 
g FIR Correction EQUALIZER omeeacinterleave Demapper Viterbi e-Randomize 


Compute Intensive Tasks 


MAC 


ARIFRAME 


IP WIRELES — °° 
TACK STACK ROUTER 


seen OECRYPTIO 
PARSER 


MESSAGE 
HANDLER 
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802.16d Transmit (PHY) 
CC ——— 


3416% 
REED-SOLOMON omplex WiMax MAC & PHY processing 
oun gle chip acceleration replaces 
3.5 MHz, 64 QAM, 3/4 RATE SPs & FPGAs 
TDD 50% TRANSMIT, 50% RECEIVE  man-months to port algorithms 
¢ n 1 C/C++ 

CONVOLUTIONAL 3000% 
ENCODER 
QAM 

MAPPER 8% 
REST 
ORIGINAL C ACCELERATED C 
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802.16d Receiver (PHY) 
— re 


2416% 


REED-SOLOMON = ; ’ 
ENCODER mplex WiMax MAC & PHY processing 
igle-chip acceleration replaces 
3.5 MHz, 64 QAM, 3/4 RATE Ps & FPGAs 


TDD 50% TRANSMIT, 50% RECEIVE 


VITERBI DECODER 2000% 


QAM 
MAPPER 


IFFT 
REST 


ORIGINAL C ACCELERATED C 
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Summary 


—— re 


= C/C++ Programmers can tailor the processor for their application 
— Continue to develop & debug in a familiar environment (IDE/gdb) 


« Application-specific instruction provide 10-100X speedup 
— Modest porting effort 


= C/C++ programmers can easily design high-performance systems 
— Enabling a new system design methodology 
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